Is there a significant difference in income between men and women? Does the difference vary depending on other factors discussed below?
To address this problem, we will use the NLSY97 (National Longitudinal Survey of Youth, 1997 cohort) data set.
We have changed column names for 19 variables which we believe to have an impact on income gap between male and female
Below is the list of shortlisted variables and the reasons behind their selection
After creating a subset, we can see that all our selected variables are numeric. We can do so by running a str command in our analysis. Our renamed_nlsy1 dataset returns 5053 rows and 19 columns. The original data has 8984 rows and 95 variables/ columns. As we placed a filter command on our income and total.incarcerations variable, i.e. greater than zero, the dataset drops all rows with income and total.incarcerations less than zero and returns 5053 rows. The negatives values in the dataset provide data points for missing values (-1 for refusal, -2 dont know, -3 invalid skip, -4 valid skip, -5 non-interview). Since all these values do not record any income level or incarcerations, we decided to let go of them.
Currently, we can see that the data types are set to numeric for all variables. Their category labels are listed in the variable codebook which will be handled later in data cleaning part.
renamed_nlsy1 <- nlsy %>%
filter( income > 0 & total.incarcerations >= 0) %>%
select(
"total.incarcerations",
"marijuana",
"sex",
"marital.status",
"high.school.diploma",
"parenthood.by.20",
"chance.college.degree",
"hard.times",
"work.limitation",
"citizenship",
"mother.edu",
"father.edu",
"race",
"hard.drugs",
"public.private",
"highest.degree",
"emp_industry_2011",
"job.type",
"income")## [1] 5053 19
## tibble [5,053 × 19] (S3: tbl_df/tbl/data.frame)
## $ total.incarcerations : num [1:5053] 0 0 0 0 0 0 0 0 0 0 ...
## $ marijuana : num [1:5053] 0 0 1 0 0 0 0 0 0 0 ...
## $ sex : num [1:5053] 1 2 1 1 1 2 1 1 1 1 ...
## $ marital.status : num [1:5053] 0 0 1 1 1 1 0 0 0 0 ...
## $ high.school.diploma : num [1:5053] -4 100 -4 -4 -4 -4 95 -4 -4 -4 ...
## $ parenthood.by.20 : num [1:5053] -4 0 -4 -4 -4 -4 0 -4 -4 -4 ...
## $ chance.college.degree: num [1:5053] -4 100 -4 -4 -4 -4 95 -4 -4 -4 ...
## $ hard.times : num [1:5053] -4 0 0 0 0 0 0 -4 0 0 ...
## $ work.limitation : num [1:5053] -4 1 0 0 0 0 0 -4 0 0 ...
## $ citizenship : num [1:5053] -4 3 3 3 3 1 1 -4 3 1 ...
## $ mother.edu : num [1:5053] 15 12 12 12 12 14 12 6 11 15 ...
## $ father.edu : num [1:5053] 14 -4 12 6 6 -4 12 -4 -4 -4 ...
## $ race : num [1:5053] 2 2 2 4 4 2 2 2 2 1 ...
## $ hard.drugs : num [1:5053] 0 0 0 0 0 0 0 0 0 0 ...
## $ public.private : num [1:5053] -4 -4 -4 2 -5 3 -4 -4 -4 -4 ...
## $ highest.degree : num [1:5053] 2 2 2 5 -5 2 1 0 3 1 ...
## $ emp_industry_2011 : num [1:5053] 9470 8180 9470 7860 -5 8190 4470 8680 -4 -4 ...
## $ job.type : num [1:5053] 3910 4840 3740 230 1000 5860 4850 4110 4650 3500 ...
## $ income : num [1:5053] 116000 45000 125000 59000 75000 36000 63000 35000 20000 55000 ...
as.factor to convert from numeric to factor data type.#handle emp_industry and job type - discuss with everyone
missing category. After looking into each negative value’s description, we see that while refusals could convey meaningful information for variables such as marijuana use, we have still grouped them as missing due to their negligible count relative to other categories for that variable. This helped us in getting a count of missing values for each variable. Moving forward, we decided to drop variables with a large % of missing. We set our threshold to 40% which meant that if missing values comprise more than 40% of a variable, it would not convey any meaningful information for the chosen variables. By this measure, we decided to let go of high.school.diploma, parenthood.by.20, chance.college.degree, and public.private.## [1] "data.frame"
| x | |
|---|---|
| total.incarcerations | 0 |
| marijuana | 19 |
| sex | 0 |
| marital.status | 30 |
| high.school.diploma | 3116 |
| parenthood.by.20 | 3130 |
| chance.college.degree | 3119 |
| hard.times | 560 |
| work.limitation | 557 |
| citizenship | 532 |
| mother.edu | 349 |
| father.edu | 1725 |
| race | 0 |
| hard.drugs | 235 |
| public.private | 4033 |
| highest.degree | 308 |
| emp_industry_2011 | 749 |
| job.type | 125 |
| income | 0 |
## 'data.frame': 5053 obs. of 19 variables:
## $ total.incarcerations : num 0 0 0 0 0 0 0 0 0 0 ...
## $ marijuana : Factor w/ 3 levels "no","yes","missing": 1 1 2 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "Male","Female": 1 2 1 1 1 2 1 1 1 1 ...
## $ marital.status : Factor w/ 6 levels "Divorced","Married",..: 4 4 2 2 2 2 4 4 4 4 ...
## $ high.school.diploma : Factor w/ 12 levels "0%","1%-10%",..: 12 11 12 12 12 12 11 12 12 12 ...
## $ parenthood.by.20 : Factor w/ 12 levels "0%","1%-10%",..: 12 1 12 12 12 12 1 12 12 12 ...
## $ chance.college.degree: Factor w/ 12 levels "0%","1%-10%",..: 12 11 12 12 12 12 11 12 12 12 ...
## $ hard.times : Factor w/ 3 levels "missing","no",..: 1 2 2 2 2 2 2 1 2 2 ...
## $ work.limitation : Factor w/ 3 levels "missing","no",..: 1 3 2 2 2 2 2 1 2 2 ...
## $ citizenship : Factor w/ 4 levels "missing","unknown.birthplace",..: 1 2 2 2 2 4 4 1 2 4 ...
## $ mother.edu : Factor w/ 22 levels "missing","1ST GRADE",..: 16 13 13 13 13 15 13 7 12 16 ...
## $ father.edu : Factor w/ 22 levels "missing","1ST GRADE",..: 15 1 13 7 7 1 13 1 1 1 ...
## $ race : Factor w/ 4 levels "Black","Hispanic",..: 2 2 2 4 4 2 2 2 2 1 ...
## $ hard.drugs : Factor w/ 3 levels "missing","no",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ public.private : Factor w/ 4 levels "missing","Private for-profit institution",..: 1 1 1 3 1 2 1 1 1 1 ...
## $ highest.degree : Factor w/ 9 levels "Associate/Junior college (AA)",..: 4 4 4 5 6 4 3 7 1 3 ...
## $ emp_industry_2011 : Factor w/ 18 levels "ACS SPECIAL CODES",..: 14 5 14 5 11 5 18 6 11 11 ...
## $ job.type : Factor w/ 33 levels "ACS SPECIAL CODES",..: 28 29 28 10 20 24 29 13 25 14 ...
## $ income : num 116000 45000 125000 59000 75000 36000 63000 35000 20000 55000 ...
Using summary, we can see that median income is 40,000 while mean is 49,781. The lower quartile and upper quartlie is 25,000 and 62,000 respectively. As mean > median, we can conclude that the data is right-skewed which could be explained by the topcoded values. This analysis is further verified as we plot our density graph for male and female. We also see that female density is greater than males for lower income levels. Using table for sex, we can observe that male and female count is 2600 and 2453 respectively which roughly divides our sample equally over two genders. As seen in the histogram plotted for income within each gender, we see that the distribution of income is not normal for both Males and Females. Similarly, we do see presence of outliers which we believe are due to topcoded income values. Moreover, upon a closer look at the graph, we see that men have their values clustered around a higher income value compared to females.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 25000 40000 49781 62000 235884
##
## Black Hispanic Mixed Race (Non-Hispanic)
## 1242 1048 47
## Non-Black / Non-Hispanic
## 2716
121 of the incomes are topcoded to the maximum value of $235,884. This is the average value of top 121 income earners. We confirmed this using the slice_max function, which pulled out the top 2% rows from our renamed_nlsy1 dataset based on income. In order to understand the topcoded values and their implication on our analysis, we first drew a histogram to discern the distribution of income for each gender. We also constructed the QQ-plot to check if the data is normal. From the graphs, as expected, the distribution had a right skew due to the presence of outliers because of which the data did not appear to be normal. Given this scenario, we decided to remove the topcoded values to ensure that they do not distort our analysis. This gave birth to the renamed_nlsy dataset, and plotting the QQ-plot for the same allows us to see a relative improvement in the normality of the income distribution.
## [1] 4933
From the box plot, we can observe that the median value for income is higher for males than for females. To check whether sex has a statistically significant effect on income, we run a t.test and check p-values. As the p-value is less than 0.05, we can reject the null hypothesis and conclude that sex does affect income.This resonates with our intended analysis that sex does impact income.
## `summarise()` ungrouping output (override with `.groups` argument)
| sex | num.obs | mean.income | sd.income | se.income |
|---|---|---|---|---|
| Male | 2510 | 51137 | 29506 | 589 |
| Female | 2423 | 39159 | 26385 | 536 |
##
## Welch Two Sample t-test
##
## data: income by sex
## t = 15.041, df = 4902.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 10416.46 13538.83
## sample estimates:
## mean in group Male mean in group Female
## 51136.71 39159.06
From the box plot, we can observe that the median income is the same whether marijuana was consumed or not. To check whether income gap is statistically significant between men and women within marijuana users and non-users, we run a t.test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across Marijuana users and non-users since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in both user and non-user groups. iii) Since the error bar for users contains the height of the bar for non-users, we can say that the income gap for users is not statistically different from the non-users group. Finally, the missing category shows a wide confidence interval indicating data unreliability which is due to a low count of 19 missing values here and the income gap is also insignificant here.
## `summarise()` ungrouping output (override with `.groups` argument)
| marijuana | num.obs | mean.income | sd.income | se.income |
|---|---|---|---|---|
| no | 3949 | 45425 | 28659 | 456 |
| yes | 966 | 44652 | 28745 | 925 |
| missing | 18 | 39961 | 19254 | 4538 |
Contrary to our expectation where we expected divorced individuals to work harder to make ends meet, we see in the box plot, we can observe that the median income is the highest for married individuals which possibly be happy household conditions which means doing well at home and at work. To check whether income gap is statistically significant between men and women within different marital.status groups, we run a t.test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically insignificant for only widowed and separated individuals since the error bars contain the null hypothesis value of 0. ii) Only the separated marital group has negative bar indicating that women are earning more than men. iii) No single error bar overlaps the height of the other bar, hence we can conclude that our results are statistically significant. Finally, the missing category has a low count therefore we do not include it in our analysis.
## `summarise()` ungrouping output (override with `.groups` argument)
| marital.status | num.obs | mean.income | sd.income | se.income |
|---|---|---|---|---|
| Divorced | 506 | 40632 | 26422 | 1175 |
| Married | 2386 | 51038 | 29830 | 611 |
| missing | 30 | 37857 | 21206 | 3872 |
| Never married | 1890 | 39793 | 26412 | 608 |
| Separated | 104 | 38600 | 27296 | 2677 |
| Widowed | 17 | 31706 | 23105 | 5604 |
## `summarise()` ungrouping output (override with `.groups` argument)
We used a density plot to visualize the income distribution for each level of educational attainment. Each distribution seems to have a right long tail indicating skewness in income when we delve into each degree level. Moreover, we run the Anova test to check whether highest degree obtained impacts income and the p-value i.e. <2e-16 reflects that average income varies across degrees obtained.
## Df Sum Sq Mean Sq F value Pr(>F)
## highest.degree 8 6.027e+11 7.533e+10 107.7 <2e-16 ***
## Residuals 4924 3.445e+12 6.996e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the box plot, we can observe that the median income is higher for individuals who have not experienced hard times. To check whether income gap is statistically significant between men and women within group of individuals who have experienced hard.times and those who did not, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across those who have experienced hard times as well as those that have not, since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in group of individuals who have experienced hard times as well as those who didnt. iii) Since the error bar from yes group contains the height of the bar from no group, we can say that the income gap for those who experienced hard times is not statistically different from those who did not.
## `summarise()` ungrouping output (override with `.groups` argument)
| hard.times | num.obs | mean.income | sd.income | se.income |
|---|---|---|---|---|
| missing | 550 | 44311 | 29457 | 1256 |
| no | 4155 | 45877 | 28779 | 446 |
| yes | 228 | 36160 | 21953 | 1454 |
## `summarise()` ungrouping output (override with `.groups` argument)
From the box plot, we can observe that the median income is higher for individuals who do not have any work limitations. To check whether income gap is statistically significant between men and women within group of individuals who do not have work limitations and those who do, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across those who do not have work limitations and those who do, since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in group of individuals who have work limitations as well as those who do not. iii) Since the error bar from yes group contains the height of the bar from no group and vice versa, we can say that the income gap for those who have work limitations is not statistically different from those who do not.
## `summarise()` ungrouping output (override with `.groups` argument)
| work.limitation | num.obs | mean.income | sd.income | se.income |
|---|---|---|---|---|
| missing | 547 | 44379 | 29524 | 1262 |
| no | 4139 | 45909 | 28631 | 445 |
| yes | 247 | 36212 | 25254 | 1607 |
## `summarise()` ungrouping output (override with `.groups` argument)
From the box plot, we can observe that the median income is the highest for 1ST GRADE individuals. Given the minimal no. of observations for 1ST GRADE and UNGRADED, i.e. 4, we believe that t test would not have been accurate because the number of values were very small.The box plot is also spread out for 1ST GRADE due to the same issue. Hence, we decided to drop 1ST GRADE and UNGRADED for our statistical significance test. To check whether income gap is statistically significant between men and women within group of individuals who have different levels of mother education, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across have different levels of mother education except for 4th GRADE, since the error bars do not contain the null hypothesis value of 0. ii) The positive bars (excluding 4th GRADE) serve as an evidence that men earn more than women in group of individuals who have different levels of mother education. iii) Since the error bar from 2nd GRADE group contains the height of the bar from all other group, we can say that the income gap for those have different levels of mother education is not statistically different.
## `summarise()` ungrouping output (override with `.groups` argument)
| mother.edu | num.obs | mean.income | sd.income | se.income |
|---|---|---|---|---|
| missing | 342 | 39556 | 26393 | 1427 |
| 1ST GRADE | 4 | 69250 | 50022 | 25011 |
| 2ND GRADE | 15 | 35667 | 13937 | 3599 |
| 3RD GRADE | 27 | 48330 | 31144 | 5994 |
| 4TH GRADE | 32 | 38385 | 25667 | 4537 |
| 5TH GRADE | 21 | 36810 | 22164 | 4837 |
| 6TH GRADE | 123 | 40285 | 24965 | 2251 |
| 7TH GRADE | 35 | 34357 | 20013 | 3383 |
| 8TH GRADE | 97 | 35498 | 23955 | 2432 |
| 9TH GRADE | 135 | 38437 | 25801 | 2221 |
| 10TH GRADE | 212 | 36890 | 24699 | 1696 |
| 11TH GRADE | 281 | 39335 | 24870 | 1484 |
| 12TH GRADE | 1647 | 44024 | 27395 | 675 |
| 1ST YEAR COLLEGE | 378 | 47374 | 29625 | 1524 |
| 2ND YEAR COLLEGE | 579 | 45789 | 28683 | 1192 |
| 3RD YEAR COLLEGE | 150 | 45774 | 28735 | 2346 |
| 4TH YEAR COLLEGE | 548 | 54867 | 31410 | 1342 |
| 5TH YEAR COLLEGE | 106 | 60845 | 32423 | 3149 |
| 6TH YEAR COLLEGE | 132 | 57160 | 32803 | 2855 |
| 7TH YEAR COLLEGE | 19 | 53837 | 28887 | 6627 |
| 8TH YEAR COLLEGE | 46 | 64443 | 33716 | 4971 |
| UNGRADED | 4 | 21000 | 14071 | 7036 |
## `summarise()` ungrouping output (override with `.groups` argument)
Contrary to our beliefs, we see that the median income is higher for individuals with unknown birthplace as compared to the rest. To check whether income gap is statistically significant between men and women within groups of individuals with different citizenships, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across individuals with different citizenships, since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in group of individuals across different citizenships. iii) Since the error bar from anyone group contains the height of the bar from other groups, we can say that the income gap is not statistically different across citizenship.
## `summarise()` ungrouping output (override with `.groups` argument)
| citizenship | num.obs | mean.income | sd.income | se.income |
|---|---|---|---|---|
| missing | 522 | 44887 | 29779 | 1303 |
| unknown.birthplace | 407 | 47843 | 29062 | 1441 |
| Unknown.not.us.born | 128 | 48306 | 29488 | 2606 |
| US.born.citizen | 3876 | 44930 | 28410 | 456 |
We were interested in exploring the correlation between income and total incarcerations within each gender. To achieve this, we created a summary table which shows that for both men and women, this correlation is negative. Upon plotting the smooth line for the linear regression model, we deduced that for the same number of incarcerations, men on average have a higher income compared to women. Moreover, the line for females suggests that the highest number of incarcerations for women is 6, whereas for men it goes beyond 10.
## `summarise()` ungrouping output (override with `.groups` argument)
| sex | cor_inc_incarc |
|---|---|
| Male | -0.1844140 |
| Female | -0.0932244 |
## `geom_smooth()` using formula 'y ~ x'
#### 10. hard.drugs
As part of our exploratory analysis, we were interested in comparing the average income for men and women within users and non-users of hard drugs. This enabled us to discern that usage of hard drugs causes the average income for both the genders to fall. However, within both the groups: users and non-users of hard drugs, men have higher average income compared to females. Interestingly, as seen in the summary table, the usage of hard drugs impacts women more negatively, as indicated by a fall in average income by 3,017 in comparison to men who see a fall of only $1,045.Moreover, as seen in the bar graph, we see that within each group of users and non-users of hard drugs, the average income is statistically different between men and women.
We wanted to explore how income gap between vary within each race for groups who have experienced hard times and those who have not. The interpretation is as follows: i) For Hispanic as well as Non-Black/Non-Hispanic races, the income gap between men and women is statistically significant for both groups: those who have experienced hard times and those who have not. However, the income gap between the two groups is not statistically significant. ii) For the Black race, the income gap between men and women is statistically significant only for the group of individuals who have not experienced hard times. Interestingly, for the group of individuals who have experienced hard times, women earn more than men in this race but this difference is not statistically significant. iii) For the Mixed Race, none of the respondents shared experiencing hard times, where as those who did, do not have a statistically significant difference in income between men and women.
To explore the industry type variable, we decided to explore the average income and how it differs between males and females within each industry type. Although no statistical tests were performed on this data, our exploratory analysis suggests that males earn higher on average than females for all industry types except the following ACS SPECIAL CODES. This gives us an overall sense of the income gap might prevail between the two genders, notwithstanding the industry impact on income.
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
From the box plot, we can observe that the median income is the highest for 7TH YEAR COLLEGE individuals. Given the minimal no. of observations for 1ST GRADE i.e. 4, we believe that t test would not have been accurate because the number of values were very small. For 1ST GRADE, we do not see a box plot as there is only one observation. For UNGRADED, we do see a condensed box plot despite a small total value which we believe could just be a conicidence. To check whether income gap is statistically significant between men and women within group of individuals who have different levels of father education, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across different levels of father education except 3RD, 4TH, 5TH, 6TH, and 4TH YEAR COLLEGE since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in group of individuals who have different levels of father education. iii) Since the error bar from different group contains the height of the bar from all other group, we can say that the income gap for those have different levels of father education is not statistically different.
## `summarise()` ungrouping output (override with `.groups` argument)
| father.edu | num.obs | mean.income | sd.income | se.income |
|---|---|---|---|---|
| missing | 1704 | 39785 | 26139 | 633 |
| 1ST GRADE | 1 | 52000 | NA | NA |
| 2ND GRADE | 6 | 31000 | 17481 | 7137 |
| 3RD GRADE | 27 | 36519 | 30294 | 5830 |
| 4TH GRADE | 23 | 38042 | 22378 | 4666 |
| 5TH GRADE | 26 | 35351 | 18368 | 3602 |
| 6TH GRADE | 91 | 43074 | 22499 | 2359 |
| 7TH GRADE | 41 | 41768 | 26433 | 4128 |
| 8TH GRADE | 53 | 41369 | 28669 | 3938 |
| 9TH GRADE | 105 | 41050 | 29188 | 2848 |
| 10TH GRADE | 108 | 41668 | 26810 | 2580 |
| 11TH GRADE | 149 | 38365 | 25833 | 2116 |
| 12TH GRADE | 1089 | 46231 | 27561 | 835 |
| 1ST YEAR COLLEGE | 196 | 47442 | 29610 | 2115 |
| 2ND YEAR COLLEGE | 401 | 50625 | 30393 | 1518 |
| 3RD YEAR COLLEGE | 102 | 51488 | 29575 | 2928 |
| 4TH YEAR COLLEGE | 450 | 53596 | 31173 | 1470 |
| 5TH YEAR COLLEGE | 73 | 53986 | 29263 | 3425 |
| 6TH YEAR COLLEGE | 147 | 57313 | 34087 | 2811 |
| 7TH YEAR COLLEGE | 56 | 65700 | 34145 | 4563 |
| 8TH YEAR COLLEGE | 82 | 56027 | 33774 | 3730 |
| UNGRADED | 3 | 39333 | 5508 | 3180 |
## `summarise()` ungrouping output (override with `.groups` argument)
##
## Call:
## lm(formula = income ~ sex, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51135 -20137 -4659 13863 105841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51136.7 559.2 91.44 <2e-16 ***
## sexFemale -11977.6 797.9 -15.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28020 on 4931 degrees of freedom
## Multiple R-squared: 0.0437, Adjusted R-squared: 0.04351
## F-statistic: 225.3 on 1 and 4931 DF, p-value: < 2.2e-16
Confidence Level is assumed at 95% and alpha = 0.05 for p-value comparison to reject the null hypothesis
First, we ran the basic linear regression model on income against sex assuming that other variables do not impact the income variable. The coefficient for gender male is absorbed in the intercept and the male average income would be 51136.7, while the Female average income would be 51136.7 + (1)(-11977.6). Both the intercept and the sexFemale coefficient are statistically significant at p-value = 2e-16 ***. The R^2 value is 0.03, indicating that the sex variable does not explain the variation in income to a great extent. Therefore, there is a need to add more variables from the dataset.
## Analysis of Variance Table
##
## Model 1: income ~ sex
## Model 2: income ~ sex + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4931 3.8704e+12
## 2 4928 3.7318e+12 3 1.3863e+11 61.025 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55177 -19327 -4233 14673 108767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42641.5 891.1 47.852 < 2e-16 ***
## sexFemale -11408.1 784.9 -14.535 < 2e-16 ***
## raceHispanic 6480.3 1160.8 5.583 2.5e-08 ***
## raceMixed Race (Non-Hispanic) 11993.9 4090.4 2.932 0.00338 **
## raceNon-Black / Non-Hispanic 12685.7 953.1 13.310 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27520 on 4928 degrees of freedom
## Multiple R-squared: 0.07796, Adjusted R-squared: 0.07721
## F-statistic: 104.2 on 4 and 4928 DF, p-value: < 2.2e-16
The second linear regression we run consists of race along with sex against income. Firstly, we run the anova test for lm models with and without race to check for statistical significance. P-vale from the anova test for with and without race is 2.2e-16, indicating high statistical significance. Therefore, there is a need for the race variable in the analysis. The adjusted R-squared value stands at 0.07721 showing again that the variability in income is only minimally being explained by the 2nd model.
lm.add.race.interact <- lm(income ~ sex + race + sex*race, data = renamed_nlsy)
anova(lm.add.race, lm.add.race.interact)## Analysis of Variance Table
##
## Model 1: income ~ sex + race
## Model 2: income ~ sex + race + sex * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4928 3.7318e+12
## 2 4925 3.7174e+12 3 1.4341e+10 6.3332 0.000278 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race + sex * race, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56052 -19202 -4458 13798 106018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39458 1152 34.259 < 2e-16
## sexFemale -5475 1572 -3.483 0.000501
## raceHispanic 10719 1659 6.460 1.15e-10
## raceMixed Race (Non-Hispanic) 16782 5614 2.989 0.002810
## raceNon-Black / Non-Hispanic 16744 1368 12.242 < 2e-16
## sexFemale:raceHispanic -8085 2320 -3.485 0.000496
## sexFemale:raceMixed Race (Non-Hispanic) -9361 8184 -1.144 0.252759
## sexFemale:raceNon-Black / Non-Hispanic -7791 1905 -4.090 4.38e-05
##
## (Intercept) ***
## sexFemale ***
## raceHispanic ***
## raceMixed Race (Non-Hispanic) **
## raceNon-Black / Non-Hispanic ***
## sexFemale:raceHispanic ***
## sexFemale:raceMixed Race (Non-Hispanic)
## sexFemale:raceNon-Black / Non-Hispanic ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27470 on 4925 degrees of freedom
## Multiple R-squared: 0.0815, Adjusted R-squared: 0.08019
## F-statistic: 62.43 on 7 and 4925 DF, p-value: < 2.2e-16
The follow-up regression model to the 2nd model in which race was included, includes the race and sex interaction term. Earlier in the paper we discussed (see: interaction terms) the need for an interaction term between sex and race. The anova test reiterates the need for one, with the p-value = 0.000278 showing the statistical significance.
lm.add.tot.incarc <- lm(income ~ sex + race + sex*race + total.incarcerations, data = renamed_nlsy)
anova(lm.add.race.interact, lm.add.tot.incarc)## Analysis of Variance Table
##
## Model 1: income ~ sex + race + sex * race
## Model 2: income ~ sex + race + sex * race + total.incarcerations
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4925 3.7174e+12
## 2 4924 3.6393e+12 1 7.8112e+10 105.69 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations,
## data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57422 -18974 -4184 13395 105816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41681.0 1160.0 35.930 < 2e-16
## sexFemale -7497.2 1568.2 -4.781 1.80e-06
## raceHispanic 10107.9 1643.1 6.152 8.26e-10
## raceMixed Race (Non-Hispanic) 15956.8 5556.0 2.872 0.004096
## raceNon-Black / Non-Hispanic 15891.5 1356.0 11.720 < 2e-16
## total.incarcerations -6989.1 679.8 -10.280 < 2e-16
## sexFemale:raceHispanic -7317.8 2296.8 -3.186 0.001451
## sexFemale:raceMixed Race (Non-Hispanic) -8419.0 8098.6 -1.040 0.298595
## sexFemale:raceNon-Black / Non-Hispanic -6800.0 1887.3 -3.603 0.000318
##
## (Intercept) ***
## sexFemale ***
## raceHispanic ***
## raceMixed Race (Non-Hispanic) **
## raceNon-Black / Non-Hispanic ***
## total.incarcerations ***
## sexFemale:raceHispanic **
## sexFemale:raceMixed Race (Non-Hispanic)
## sexFemale:raceNon-Black / Non-Hispanic ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27190 on 4924 degrees of freedom
## Multiple R-squared: 0.1008, Adjusted R-squared: 0.09934
## F-statistic: 69 on 8 and 4924 DF, p-value: < 2.2e-16
In this regression model we added total.incarcerations variable. The anova test was conducted to check for its statistical significance. The p-value came out to be 2.2e-16, suggesting high statistical significance. The adjusted R^squared value with total.incarcerations in the model is 0.0993. This is an increase in the adjusted R^squared value suggests that with increase addition of total.incarcerations more of the variability of income is being explained by the model.
lm.add.marital.status <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status, data = renamed_nlsy)
anova(lm.add.tot.incarc, lm.add.marital.status)## Analysis of Variance Table
##
## Model 1: income ~ sex + race + sex * race + total.incarcerations
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4924 3.6393e+12
## 2 4919 3.5511e+12 5 8.8208e+10 24.437 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations +
## marital.status, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61057 -19015 -4365 13576 108292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39356.3 1644.0 23.940 < 2e-16
## sexFemale -6691.8 1551.7 -4.313 1.64e-05
## raceHispanic 9132.3 1627.3 5.612 2.11e-08
## raceMixed Race (Non-Hispanic) 14705.3 5493.4 2.677 0.007455
## raceNon-Black / Non-Hispanic 13964.7 1352.9 10.322 < 2e-16
## total.incarcerations -6241.8 676.1 -9.232 < 2e-16
## marital.statusMarried 7935.8 1321.1 6.007 2.03e-09
## marital.statusmissing -2669.6 5053.8 -0.528 0.597354
## marital.statusNever married -956.0 1357.5 -0.704 0.481306
## marital.statusSeparated -739.3 2899.1 -0.255 0.798729
## marital.statusWidowed -3085.5 6636.1 -0.465 0.641988
## sexFemale:raceHispanic -8505.0 2274.8 -3.739 0.000187
## sexFemale:raceMixed Race (Non-Hispanic) -8494.0 8004.0 -1.061 0.288640
## sexFemale:raceNon-Black / Non-Hispanic -7764.4 1869.3 -4.154 3.33e-05
##
## (Intercept) ***
## sexFemale ***
## raceHispanic ***
## raceMixed Race (Non-Hispanic) **
## raceNon-Black / Non-Hispanic ***
## total.incarcerations ***
## marital.statusMarried ***
## marital.statusmissing
## marital.statusNever married
## marital.statusSeparated
## marital.statusWidowed
## sexFemale:raceHispanic ***
## sexFemale:raceMixed Race (Non-Hispanic)
## sexFemale:raceNon-Black / Non-Hispanic ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26870 on 4919 degrees of freedom
## Multiple R-squared: 0.1226, Adjusted R-squared: 0.1203
## F-statistic: 52.87 on 13 and 4919 DF, p-value: < 2.2e-16
In this regression model we added the variable marital.status. The anova test was carried out to check for it statistical significance. The p-value for the anova test with and without marital.status came out to be 2.2e-16, suggesting statistical significance. The Adjusted R-squared: 0.1203 shows that there is an increase from the Adjusted R-squared value for the model without marital.status variable. This shows that greater variability of income can now be explained by the model.
lm.add.hard.times <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + hard.times, data = renamed_nlsy)
anova(lm.add.marital.status, lm.add.hard.times)## Analysis of Variance Table
##
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status +
## hard.times
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4919 3.5511e+12
## 2 4917 3.5402e+12 2 1.0922e+10 7.5852 0.0005139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations +
## marital.status + hard.times, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61274 -18914 -4334 13526 107971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39624.9 1981.4 19.998 < 2e-16
## sexFemale -6885.3 1550.9 -4.440 9.21e-06
## raceHispanic 9037.6 1626.7 5.556 2.91e-08
## raceMixed Race (Non-Hispanic) 14484.9 5487.9 2.639 0.008331
## raceNon-Black / Non-Hispanic 13652.3 1353.5 10.087 < 2e-16
## total.incarcerations -6081.9 676.6 -8.989 < 2e-16
## marital.statusMarried 7972.4 1319.4 6.043 1.63e-09
## marital.statusmissing -2547.0 5047.2 -0.505 0.613834
## marital.statusNever married -934.4 1355.7 -0.689 0.490708
## marital.statusSeparated -607.7 2896.0 -0.210 0.833802
## marital.statusWidowed -3056.5 6629.2 -0.461 0.644766
## hard.timesno 224.0 1224.6 0.183 0.854899
## hard.timesyes -6915.6 2119.2 -3.263 0.001109
## sexFemale:raceHispanic -8274.2 2272.6 -3.641 0.000274
## sexFemale:raceMixed Race (Non-Hispanic) -8219.2 7993.8 -1.028 0.303908
## sexFemale:raceNon-Black / Non-Hispanic -7490.5 1868.5 -4.009 6.19e-05
##
## (Intercept) ***
## sexFemale ***
## raceHispanic ***
## raceMixed Race (Non-Hispanic) **
## raceNon-Black / Non-Hispanic ***
## total.incarcerations ***
## marital.statusMarried ***
## marital.statusmissing
## marital.statusNever married
## marital.statusSeparated
## marital.statusWidowed
## hard.timesno
## hard.timesyes **
## sexFemale:raceHispanic ***
## sexFemale:raceMixed Race (Non-Hispanic)
## sexFemale:raceNon-Black / Non-Hispanic ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26830 on 4917 degrees of freedom
## Multiple R-squared: 0.1253, Adjusted R-squared: 0.1226
## F-statistic: 46.95 on 15 and 4917 DF, p-value: < 2.2e-16
In this step we are adding hard.times variable to the regression model. To check for the statistical significance of adding this variable, anova test was conducted. The p-value 0.000001279 suggests statistical significance. The Adjusted R-squared value came out to be 0.1226. This again, is an increase on the adjusted R-squared value we got from the model without hard.times, suggesting greater variability of income is now being accounted for.
lm.add.work.lim <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + hard.times + work.limitation, data = renamed_nlsy)
anova(lm.add.hard.times, lm.add.work.lim)## Analysis of Variance Table
##
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status +
## hard.times
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status +
## hard.times + work.limitation
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4917 3.5402e+12
## 2 4915 3.5165e+12 2 2.3677e+10 16.547 6.884e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations +
## marital.status + hard.times + work.limitation, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61895 -18447 -4366 13487 107580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39702.51 1978.17 20.070 < 2e-16
## sexFemale -6865.25 1546.20 -4.440 9.19e-06
## raceHispanic 9164.09 1622.14 5.649 1.70e-08
## raceMixed Race (Non-Hispanic) 14227.55 5470.85 2.601 0.009334
## raceNon-Black / Non-Hispanic 14067.14 1351.17 10.411 < 2e-16
## total.incarcerations -5951.61 675.00 -8.817 < 2e-16
## marital.statusMarried 7596.14 1316.90 5.768 8.50e-09
## marital.statusmissing -2975.30 5031.87 -0.591 0.554353
## marital.statusNever married -1146.50 1351.96 -0.848 0.396466
## marital.statusSeparated -768.66 2887.64 -0.266 0.790105
## marital.statusWidowed -3157.48 6608.33 -0.478 0.632812
## hard.timesno 713.49 8101.03 0.088 0.929821
## hard.timesyes -6059.94 8217.16 -0.737 0.460870
## work.limitationno 16.07 8118.59 0.002 0.998421
## work.limitationyes -10117.61 8267.84 -1.224 0.221113
## sexFemale:raceHispanic -8441.76 2265.63 -3.726 0.000197
## sexFemale:raceMixed Race (Non-Hispanic) -7738.79 7969.10 -0.971 0.331546
## sexFemale:raceNon-Black / Non-Hispanic -7586.76 1862.88 -4.073 4.72e-05
##
## (Intercept) ***
## sexFemale ***
## raceHispanic ***
## raceMixed Race (Non-Hispanic) **
## raceNon-Black / Non-Hispanic ***
## total.incarcerations ***
## marital.statusMarried ***
## marital.statusmissing
## marital.statusNever married
## marital.statusSeparated
## marital.statusWidowed
## hard.timesno
## hard.timesyes
## work.limitationno
## work.limitationyes
## sexFemale:raceHispanic ***
## sexFemale:raceMixed Race (Non-Hispanic)
## sexFemale:raceNon-Black / Non-Hispanic ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26750 on 4915 degrees of freedom
## Multiple R-squared: 0.1311, Adjusted R-squared: 0.1281
## F-statistic: 43.64 on 17 and 4915 DF, p-value: < 2.2e-16
In this step we added another variable, that is work.limitation, to the regression model. The p-value for the anova for models with and without the variable work.limitation is 5.549e-10, suggesting statistical significance. The adjusted R-squared value came out to be 0.1281. This suggested an increase in R-squared value from the model without the work.limitation. This again, suggests progressively greater variability of the output is being explained by the model as more relevant variables are being added.
lm.add.highest.degree <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + marital.status + work.limitation + highest.degree, data = renamed_nlsy)
anova(lm.add.work.lim, lm.add.highest.degree)## Analysis of Variance Table
##
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status +
## hard.times + work.limitation
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status +
## marital.status + work.limitation + highest.degree
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4915 3.5165e+12
## 2 4909 3.0241e+12 6 4.9239e+11 133.22 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations +
## marital.status + marital.status + work.limitation + highest.degree,
## data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79974 -16861 -3488 12887 110464
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 49536.6 2248.9 22.027
## sexFemale -8902.7 1438.3 -6.190
## raceHispanic 9324.9 1507.6 6.185
## raceMixed Race (Non-Hispanic) 5054.3 5089.7 0.993
## raceNon-Black / Non-Hispanic 11067.3 1258.4 8.795
## total.incarcerations -3566.6 639.4 -5.578
## marital.statusMarried 3167.8 1233.1 2.569
## marital.statusmissing -3860.1 4671.8 -0.826
## marital.statusNever married -3596.5 1258.8 -2.857
## marital.statusSeparated -636.3 2682.6 -0.237
## marital.statusWidowed -2121.4 6138.8 -0.346
## work.limitationno -1610.9 1139.3 -1.414
## work.limitationyes -8622.1 1919.3 -4.492
## highest.degreeBachelor's degree (BA, BS) 9854.1 1496.8 6.584
## highest.degreeGED -13794.9 1714.0 -8.048
## highest.degreeHigh school diploma 12 year -8283.1 1400.4 -5.915
## highest.degreeMaster's degree (MA, MS) 18525.8 2002.8 9.250
## highest.degreemissing -2452.1 1935.4 -1.267
## highest.degreeNone -18919.7 1966.3 -9.622
## highest.degreePhD 37889.6 7290.4 5.197
## highest.degreeProfessional degree (DDS, JD, MD) 38889.8 4871.5 7.983
## sexFemale:raceHispanic -7534.0 2103.0 -3.583
## sexFemale:raceMixed Race (Non-Hispanic) 945.5 7405.3 0.128
## sexFemale:raceNon-Black / Non-Hispanic -8674.3 1730.3 -5.013
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## sexFemale 6.52e-10 ***
## raceHispanic 6.71e-10 ***
## raceMixed Race (Non-Hispanic) 0.320742
## raceNon-Black / Non-Hispanic < 2e-16 ***
## total.incarcerations 2.56e-08 ***
## marital.statusMarried 0.010226 *
## marital.statusmissing 0.408690
## marital.statusNever married 0.004293 **
## marital.statusSeparated 0.812530
## marital.statusWidowed 0.729677
## work.limitationno 0.157465
## work.limitationyes 7.21e-06 ***
## highest.degreeBachelor's degree (BA, BS) 5.07e-11 ***
## highest.degreeGED 1.04e-15 ***
## highest.degreeHigh school diploma 12 year 3.55e-09 ***
## highest.degreeMaster's degree (MA, MS) < 2e-16 ***
## highest.degreemissing 0.205217
## highest.degreeNone < 2e-16 ***
## highest.degreePhD 2.11e-07 ***
## highest.degreeProfessional degree (DDS, JD, MD) 1.76e-15 ***
## sexFemale:raceHispanic 0.000344 ***
## sexFemale:raceMixed Race (Non-Hispanic) 0.898412
## sexFemale:raceNon-Black / Non-Hispanic 5.54e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24820 on 4909 degrees of freedom
## Multiple R-squared: 0.2528, Adjusted R-squared: 0.2493
## F-statistic: 72.21 on 23 and 4909 DF, p-value: < 2.2e-16
Next we decided to add the variable emp_industry_2011 to our model. To find out if it was statistically significant, we ran an anova test with and without this variable. The p-value for this test came out to be 2.2e-16, suggesting a very high statistical significance. The Adjusted R-squared: 0.2493 shows that there is an increase from the Adjusted R-squared value for the model without emp_industry_2011. This shows that a higher amount of variability of income can be explained using this variable. This makes sense, because better employment opportunities lead to a higher income.
lm.add.emp.ind <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + marital.status + work.limitation + highest.degree + emp_industry_2011, data = renamed_nlsy)
anova(lm.add.highest.degree, lm.add.emp.ind)## Analysis of Variance Table
##
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status +
## marital.status + work.limitation + highest.degree
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status +
## marital.status + work.limitation + highest.degree + emp_industry_2011
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4909 3.0241e+12
## 2 4892 2.9076e+12 17 1.1651e+11 11.531 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations +
## marital.status + marital.status + work.limitation + highest.degree +
## emp_industry_2011, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73833 -16149 -3133 12629 106146
##
## Coefficients:
## Estimate
## (Intercept) 46157.85
## sexFemale -6806.61
## raceHispanic 8988.70
## raceMixed Race (Non-Hispanic) 6649.36
## raceNon-Black / Non-Hispanic 10385.39
## total.incarcerations -3279.25
## marital.statusMarried 3090.45
## marital.statusmissing -1495.51
## marital.statusNever married -3216.59
## marital.statusSeparated -39.13
## marital.statusWidowed -1817.19
## work.limitationno -1274.25
## work.limitationyes -8292.08
## highest.degreeBachelor's degree (BA, BS) 9392.17
## highest.degreeGED -13427.66
## highest.degreeHigh school diploma 12 year -8142.26
## highest.degreeMaster's degree (MA, MS) 18352.54
## highest.degreemissing 2559.04
## highest.degreeNone -17978.55
## highest.degreePhD 39730.44
## highest.degreeProfessional degree (DDS, JD, MD) 37282.71
## emp_industry_2011ACTIVE DUTY MILITARY 14374.04
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 3377.00
## emp_industry_2011CONSTRUCTION 7195.99
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES -1006.62
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -3396.45
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 7339.02
## emp_industry_2011INFORMATION AND COMMUNICATION 6301.95
## emp_industry_2011MANUFACTURING 7508.35
## emp_industry_2011MINING 25475.80
## emp_industry_2011missing -3351.72
## emp_industry_2011OTHER SERVICES -3481.04
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 5175.87
## emp_industry_2011PUBLIC ADMINISTRATION 10326.64
## emp_industry_2011RETAIL TRADE -868.36
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 7317.14
## emp_industry_2011UTILITIES 21433.31
## emp_industry_2011WHOLESALE TRADE 5910.18
## sexFemale:raceHispanic -7432.29
## sexFemale:raceMixed Race (Non-Hispanic) 691.75
## sexFemale:raceNon-Black / Non-Hispanic -8106.49
## Std. Error
## (Intercept) 8900.58
## sexFemale 1437.85
## raceHispanic 1483.80
## raceMixed Race (Non-Hispanic) 5008.97
## raceNon-Black / Non-Hispanic 1244.27
## total.incarcerations 632.77
## marital.statusMarried 1215.52
## marital.statusmissing 4602.01
## marital.statusNever married 1240.79
## marital.statusSeparated 2636.98
## marital.statusWidowed 6033.17
## work.limitationno 1120.76
## work.limitationyes 1887.05
## highest.degreeBachelor's degree (BA, BS) 1476.15
## highest.degreeGED 1693.43
## highest.degreeHigh school diploma 12 year 1383.08
## highest.degreeMaster's degree (MA, MS) 1984.83
## highest.degreemissing 2112.57
## highest.degreeNone 1949.51
## highest.degreePhD 7170.89
## highest.degreeProfessional degree (DDS, JD, MD) 4802.70
## emp_industry_2011ACTIVE DUTY MILITARY 10386.07
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 9656.47
## emp_industry_2011CONSTRUCTION 8782.80
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES 8670.15
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES 8708.19
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 8747.52
## emp_industry_2011INFORMATION AND COMMUNICATION 8939.55
## emp_industry_2011MANUFACTURING 8745.28
## emp_industry_2011MINING 9834.23
## emp_industry_2011missing 8702.65
## emp_industry_2011OTHER SERVICES 8797.64
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 8699.90
## emp_industry_2011PUBLIC ADMINISTRATION 8809.51
## emp_industry_2011RETAIL TRADE 8708.54
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 8887.03
## emp_industry_2011UTILITIES 10021.98
## emp_industry_2011WHOLESALE TRADE 8930.62
## sexFemale:raceHispanic 2069.54
## sexFemale:raceMixed Race (Non-Hispanic) 7279.02
## sexFemale:raceNon-Black / Non-Hispanic 1706.04
## t value
## (Intercept) 5.186
## sexFemale -4.734
## raceHispanic 6.058
## raceMixed Race (Non-Hispanic) 1.327
## raceNon-Black / Non-Hispanic 8.347
## total.incarcerations -5.182
## marital.statusMarried 2.542
## marital.statusmissing -0.325
## marital.statusNever married -2.592
## marital.statusSeparated -0.015
## marital.statusWidowed -0.301
## work.limitationno -1.137
## work.limitationyes -4.394
## highest.degreeBachelor's degree (BA, BS) 6.363
## highest.degreeGED -7.929
## highest.degreeHigh school diploma 12 year -5.887
## highest.degreeMaster's degree (MA, MS) 9.246
## highest.degreemissing 1.211
## highest.degreeNone -9.222
## highest.degreePhD 5.541
## highest.degreeProfessional degree (DDS, JD, MD) 7.763
## emp_industry_2011ACTIVE DUTY MILITARY 1.384
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 0.350
## emp_industry_2011CONSTRUCTION 0.819
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES -0.116
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -0.390
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 0.839
## emp_industry_2011INFORMATION AND COMMUNICATION 0.705
## emp_industry_2011MANUFACTURING 0.859
## emp_industry_2011MINING 2.591
## emp_industry_2011missing -0.385
## emp_industry_2011OTHER SERVICES -0.396
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 0.595
## emp_industry_2011PUBLIC ADMINISTRATION 1.172
## emp_industry_2011RETAIL TRADE -0.100
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 0.823
## emp_industry_2011UTILITIES 2.139
## emp_industry_2011WHOLESALE TRADE 0.662
## sexFemale:raceHispanic -3.591
## sexFemale:raceMixed Race (Non-Hispanic) 0.095
## sexFemale:raceNon-Black / Non-Hispanic -4.752
## Pr(>|t|)
## (Intercept) 2.24e-07 ***
## sexFemale 2.26e-06 ***
## raceHispanic 1.48e-09 ***
## raceMixed Race (Non-Hispanic) 0.184409
## raceNon-Black / Non-Hispanic < 2e-16 ***
## total.incarcerations 2.28e-07 ***
## marital.statusMarried 0.011037 *
## marital.statusmissing 0.745219
## marital.statusNever married 0.009560 **
## marital.statusSeparated 0.988161
## marital.statusWidowed 0.763275
## work.limitationno 0.255614
## work.limitationyes 1.14e-05 ***
## highest.degreeBachelor's degree (BA, BS) 2.16e-10 ***
## highest.degreeGED 2.71e-15 ***
## highest.degreeHigh school diploma 12 year 4.19e-09 ***
## highest.degreeMaster's degree (MA, MS) < 2e-16 ***
## highest.degreemissing 0.225823
## highest.degreeNone < 2e-16 ***
## highest.degreePhD 3.17e-08 ***
## highest.degreeProfessional degree (DDS, JD, MD) 1.00e-14 ***
## emp_industry_2011ACTIVE DUTY MILITARY 0.166430
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 0.726569
## emp_industry_2011CONSTRUCTION 0.412639
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES 0.907576
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES 0.696532
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 0.401520
## emp_industry_2011INFORMATION AND COMMUNICATION 0.480874
## emp_industry_2011MANUFACTURING 0.390625
## emp_industry_2011MINING 0.009611 **
## emp_industry_2011missing 0.700152
## emp_industry_2011OTHER SERVICES 0.692359
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 0.551915
## emp_industry_2011PUBLIC ADMINISTRATION 0.241168
## emp_industry_2011RETAIL TRADE 0.920576
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 0.410349
## emp_industry_2011UTILITIES 0.032515 *
## emp_industry_2011WHOLESALE TRADE 0.508138
## sexFemale:raceHispanic 0.000332 ***
## sexFemale:raceMixed Race (Non-Hispanic) 0.924292
## sexFemale:raceNon-Black / Non-Hispanic 2.08e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24380 on 4892 degrees of freedom
## Multiple R-squared: 0.2816, Adjusted R-squared: 0.2757
## F-statistic: 47.94 on 40 and 4892 DF, p-value: < 2.2e-16
Next we decided to check with the variable highest.degree because intuitively, it also seems to affect the income variable. To find out if this was actually true and statistically significant, we ran an anova test with and without this variable. The p-value for this test came out to be 2.2e-16, suggesting a very high statistical significance. The Adjusted R-squared: 0.2757 shows that a lot of variability of the income variable can be explained with the variable highest.degree. -
lm.add.citizenship <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + marital.status + work.limitation + highest.degree + emp_industry_2011 + citizenship, data = renamed_nlsy)
anova(lm.add.emp.ind, lm.add.citizenship)## Analysis of Variance Table
##
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status +
## marital.status + work.limitation + highest.degree + emp_industry_2011
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status +
## marital.status + work.limitation + highest.degree + emp_industry_2011 +
## citizenship
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4892 2.9076e+12
## 2 4889 2.8959e+12 3 1.1703e+10 6.5857 0.0001941 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations +
## marital.status + marital.status + work.limitation + highest.degree +
## emp_industry_2011 + citizenship, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73713 -16025 -3103 12843 106607
##
## Coefficients:
## Estimate
## (Intercept) 47391.08
## sexFemale -6727.94
## raceHispanic 7154.85
## raceMixed Race (Non-Hispanic) 6449.02
## raceNon-Black / Non-Hispanic 10422.50
## total.incarcerations -3212.12
## marital.statusMarried 3252.41
## marital.statusmissing -1817.89
## marital.statusNever married -3156.73
## marital.statusSeparated -94.99
## marital.statusWidowed -1758.43
## work.limitationno 5591.76
## work.limitationyes -1340.46
## highest.degreeBachelor's degree (BA, BS) 9436.58
## highest.degreeGED -13179.92
## highest.degreeHigh school diploma 12 year -7967.04
## highest.degreeMaster's degree (MA, MS) 18418.95
## highest.degreemissing 2521.14
## highest.degreeNone -18042.32
## highest.degreePhD 39296.52
## highest.degreeProfessional degree (DDS, JD, MD) 37439.44
## emp_industry_2011ACTIVE DUTY MILITARY 13871.21
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 2901.86
## emp_industry_2011CONSTRUCTION 6759.10
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES -1671.89
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -3994.65
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 6730.40
## emp_industry_2011INFORMATION AND COMMUNICATION 5665.48
## emp_industry_2011MANUFACTURING 6906.92
## emp_industry_2011MINING 25441.24
## emp_industry_2011missing -3887.45
## emp_industry_2011OTHER SERVICES -4234.86
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 4492.57
## emp_industry_2011PUBLIC ADMINISTRATION 9817.80
## emp_industry_2011RETAIL TRADE -1428.69
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 6797.22
## emp_industry_2011UTILITIES 20796.20
## emp_industry_2011WHOLESALE TRADE 5338.84
## citizenshipunknown.birthplace -2237.33
## citizenshipUnknown.not.us.born -4336.43
## citizenshipUS.born.citizen -8033.69
## sexFemale:raceHispanic -7575.78
## sexFemale:raceMixed Race (Non-Hispanic) 589.36
## sexFemale:raceNon-Black / Non-Hispanic -8128.30
## Std. Error
## (Intercept) 8890.11
## sexFemale 1435.56
## raceHispanic 1551.43
## raceMixed Race (Non-Hispanic) 5003.73
## raceNon-Black / Non-Hispanic 1242.39
## total.incarcerations 631.95
## marital.statusMarried 1214.53
## marital.statusmissing 4594.98
## marital.statusNever married 1238.87
## marital.statusSeparated 2632.72
## marital.statusWidowed 6024.00
## work.limitationno 4898.87
## work.limitationyes 5123.18
## highest.degreeBachelor's degree (BA, BS) 1474.27
## highest.degreeGED 1691.50
## highest.degreeHigh school diploma 12 year 1381.68
## highest.degreeMaster's degree (MA, MS) 1982.47
## highest.degreemissing 2109.94
## highest.degreeNone 1947.26
## highest.degreePhD 7161.69
## highest.degreeProfessional degree (DDS, JD, MD) 4794.69
## emp_industry_2011ACTIVE DUTY MILITARY 10369.15
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 9640.91
## emp_industry_2011CONSTRUCTION 8769.19
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES 8657.39
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES 8694.93
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 8734.48
## emp_industry_2011INFORMATION AND COMMUNICATION 8925.98
## emp_industry_2011MANUFACTURING 8731.63
## emp_industry_2011MINING 9817.50
## emp_industry_2011missing 8689.41
## emp_industry_2011OTHER SERVICES 8784.72
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 8687.17
## emp_industry_2011PUBLIC ADMINISTRATION 8795.53
## emp_industry_2011RETAIL TRADE 8695.51
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 8872.84
## emp_industry_2011UTILITIES 10009.73
## emp_industry_2011WHOLESALE TRADE 8916.91
## citizenshipunknown.birthplace 5137.17
## citizenshipUnknown.not.us.born 5484.89
## citizenshipUS.born.citizen 5002.79
## sexFemale:raceHispanic 2067.36
## sexFemale:raceMixed Race (Non-Hispanic) 7268.97
## sexFemale:raceNon-Black / Non-Hispanic 1703.33
## t value
## (Intercept) 5.331
## sexFemale -4.687
## raceHispanic 4.612
## raceMixed Race (Non-Hispanic) 1.289
## raceNon-Black / Non-Hispanic 8.389
## total.incarcerations -5.083
## marital.statusMarried 2.678
## marital.statusmissing -0.396
## marital.statusNever married -2.548
## marital.statusSeparated -0.036
## marital.statusWidowed -0.292
## work.limitationno 1.141
## work.limitationyes -0.262
## highest.degreeBachelor's degree (BA, BS) 6.401
## highest.degreeGED -7.792
## highest.degreeHigh school diploma 12 year -5.766
## highest.degreeMaster's degree (MA, MS) 9.291
## highest.degreemissing 1.195
## highest.degreeNone -9.265
## highest.degreePhD 5.487
## highest.degreeProfessional degree (DDS, JD, MD) 7.809
## emp_industry_2011ACTIVE DUTY MILITARY 1.338
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 0.301
## emp_industry_2011CONSTRUCTION 0.771
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES -0.193
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -0.459
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 0.771
## emp_industry_2011INFORMATION AND COMMUNICATION 0.635
## emp_industry_2011MANUFACTURING 0.791
## emp_industry_2011MINING 2.591
## emp_industry_2011missing -0.447
## emp_industry_2011OTHER SERVICES -0.482
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 0.517
## emp_industry_2011PUBLIC ADMINISTRATION 1.116
## emp_industry_2011RETAIL TRADE -0.164
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 0.766
## emp_industry_2011UTILITIES 2.078
## emp_industry_2011WHOLESALE TRADE 0.599
## citizenshipunknown.birthplace -0.436
## citizenshipUnknown.not.us.born -0.791
## citizenshipUS.born.citizen -1.606
## sexFemale:raceHispanic -3.664
## sexFemale:raceMixed Race (Non-Hispanic) 0.081
## sexFemale:raceNon-Black / Non-Hispanic -4.772
## Pr(>|t|)
## (Intercept) 1.02e-07 ***
## sexFemale 2.85e-06 ***
## raceHispanic 4.09e-06 ***
## raceMixed Race (Non-Hispanic) 0.19751
## raceNon-Black / Non-Hispanic < 2e-16 ***
## total.incarcerations 3.86e-07 ***
## marital.statusMarried 0.00743 **
## marital.statusmissing 0.69240
## marital.statusNever married 0.01086 *
## marital.statusSeparated 0.97122
## marital.statusWidowed 0.77037
## work.limitationno 0.25374
## work.limitationyes 0.79361
## highest.degreeBachelor's degree (BA, BS) 1.69e-10 ***
## highest.degreeGED 8.01e-15 ***
## highest.degreeHigh school diploma 12 year 8.61e-09 ***
## highest.degreeMaster's degree (MA, MS) < 2e-16 ***
## highest.degreemissing 0.23219
## highest.degreeNone < 2e-16 ***
## highest.degreePhD 4.29e-08 ***
## highest.degreeProfessional degree (DDS, JD, MD) 7.03e-15 ***
## emp_industry_2011ACTIVE DUTY MILITARY 0.18104
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 0.76343
## emp_industry_2011CONSTRUCTION 0.44088
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES 0.84687
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES 0.64595
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 0.44101
## emp_industry_2011INFORMATION AND COMMUNICATION 0.52564
## emp_industry_2011MANUFACTURING 0.42897
## emp_industry_2011MINING 0.00959 **
## emp_industry_2011missing 0.65462
## emp_industry_2011OTHER SERVICES 0.62978
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 0.60507
## emp_industry_2011PUBLIC ADMINISTRATION 0.26438
## emp_industry_2011RETAIL TRADE 0.86950
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 0.44367
## emp_industry_2011UTILITIES 0.03780 *
## emp_industry_2011WHOLESALE TRADE 0.54938
## citizenshipunknown.birthplace 0.66320
## citizenshipUnknown.not.us.born 0.42921
## citizenshipUS.born.citizen 0.10837
## sexFemale:raceHispanic 0.00025 ***
## sexFemale:raceMixed Race (Non-Hispanic) 0.93538
## sexFemale:raceNon-Black / Non-Hispanic 1.88e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24340 on 4889 degrees of freedom
## Multiple R-squared: 0.2845, Adjusted R-squared: 0.2782
## F-statistic: 45.21 on 43 and 4889 DF, p-value: < 2.2e-16
We also decided to check the variable citizenship in case there were any statistically significant differences in income because of this. The p-value came out to be 0.00000006406 which is very significant. Adjusted R squared value came out to be 0.2782 which means that citizenship does explain a lot of variability of the data, along with the other variables.
##
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations +
## marital.status + marital.status + work.limitation + highest.degree +
## emp_industry_2011 + citizenship, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73713 -16025 -3103 12843 106607
##
## Coefficients:
## Estimate
## (Intercept) 47391.08
## sexFemale -6727.94
## raceHispanic 7154.85
## raceMixed Race (Non-Hispanic) 6449.02
## raceNon-Black / Non-Hispanic 10422.50
## total.incarcerations -3212.12
## marital.statusMarried 3252.41
## marital.statusmissing -1817.89
## marital.statusNever married -3156.73
## marital.statusSeparated -94.99
## marital.statusWidowed -1758.43
## work.limitationno 5591.76
## work.limitationyes -1340.46
## highest.degreeBachelor's degree (BA, BS) 9436.58
## highest.degreeGED -13179.92
## highest.degreeHigh school diploma 12 year -7967.04
## highest.degreeMaster's degree (MA, MS) 18418.95
## highest.degreemissing 2521.14
## highest.degreeNone -18042.32
## highest.degreePhD 39296.52
## highest.degreeProfessional degree (DDS, JD, MD) 37439.44
## emp_industry_2011ACTIVE DUTY MILITARY 13871.21
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 2901.86
## emp_industry_2011CONSTRUCTION 6759.10
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES -1671.89
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -3994.65
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 6730.40
## emp_industry_2011INFORMATION AND COMMUNICATION 5665.48
## emp_industry_2011MANUFACTURING 6906.92
## emp_industry_2011MINING 25441.24
## emp_industry_2011missing -3887.45
## emp_industry_2011OTHER SERVICES -4234.86
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 4492.57
## emp_industry_2011PUBLIC ADMINISTRATION 9817.80
## emp_industry_2011RETAIL TRADE -1428.69
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 6797.22
## emp_industry_2011UTILITIES 20796.20
## emp_industry_2011WHOLESALE TRADE 5338.84
## citizenshipunknown.birthplace -2237.33
## citizenshipUnknown.not.us.born -4336.43
## citizenshipUS.born.citizen -8033.69
## sexFemale:raceHispanic -7575.78
## sexFemale:raceMixed Race (Non-Hispanic) 589.36
## sexFemale:raceNon-Black / Non-Hispanic -8128.30
## Std. Error
## (Intercept) 8890.11
## sexFemale 1435.56
## raceHispanic 1551.43
## raceMixed Race (Non-Hispanic) 5003.73
## raceNon-Black / Non-Hispanic 1242.39
## total.incarcerations 631.95
## marital.statusMarried 1214.53
## marital.statusmissing 4594.98
## marital.statusNever married 1238.87
## marital.statusSeparated 2632.72
## marital.statusWidowed 6024.00
## work.limitationno 4898.87
## work.limitationyes 5123.18
## highest.degreeBachelor's degree (BA, BS) 1474.27
## highest.degreeGED 1691.50
## highest.degreeHigh school diploma 12 year 1381.68
## highest.degreeMaster's degree (MA, MS) 1982.47
## highest.degreemissing 2109.94
## highest.degreeNone 1947.26
## highest.degreePhD 7161.69
## highest.degreeProfessional degree (DDS, JD, MD) 4794.69
## emp_industry_2011ACTIVE DUTY MILITARY 10369.15
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 9640.91
## emp_industry_2011CONSTRUCTION 8769.19
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES 8657.39
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES 8694.93
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 8734.48
## emp_industry_2011INFORMATION AND COMMUNICATION 8925.98
## emp_industry_2011MANUFACTURING 8731.63
## emp_industry_2011MINING 9817.50
## emp_industry_2011missing 8689.41
## emp_industry_2011OTHER SERVICES 8784.72
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 8687.17
## emp_industry_2011PUBLIC ADMINISTRATION 8795.53
## emp_industry_2011RETAIL TRADE 8695.51
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 8872.84
## emp_industry_2011UTILITIES 10009.73
## emp_industry_2011WHOLESALE TRADE 8916.91
## citizenshipunknown.birthplace 5137.17
## citizenshipUnknown.not.us.born 5484.89
## citizenshipUS.born.citizen 5002.79
## sexFemale:raceHispanic 2067.36
## sexFemale:raceMixed Race (Non-Hispanic) 7268.97
## sexFemale:raceNon-Black / Non-Hispanic 1703.33
## t value
## (Intercept) 5.331
## sexFemale -4.687
## raceHispanic 4.612
## raceMixed Race (Non-Hispanic) 1.289
## raceNon-Black / Non-Hispanic 8.389
## total.incarcerations -5.083
## marital.statusMarried 2.678
## marital.statusmissing -0.396
## marital.statusNever married -2.548
## marital.statusSeparated -0.036
## marital.statusWidowed -0.292
## work.limitationno 1.141
## work.limitationyes -0.262
## highest.degreeBachelor's degree (BA, BS) 6.401
## highest.degreeGED -7.792
## highest.degreeHigh school diploma 12 year -5.766
## highest.degreeMaster's degree (MA, MS) 9.291
## highest.degreemissing 1.195
## highest.degreeNone -9.265
## highest.degreePhD 5.487
## highest.degreeProfessional degree (DDS, JD, MD) 7.809
## emp_industry_2011ACTIVE DUTY MILITARY 1.338
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 0.301
## emp_industry_2011CONSTRUCTION 0.771
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES -0.193
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES -0.459
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 0.771
## emp_industry_2011INFORMATION AND COMMUNICATION 0.635
## emp_industry_2011MANUFACTURING 0.791
## emp_industry_2011MINING 2.591
## emp_industry_2011missing -0.447
## emp_industry_2011OTHER SERVICES -0.482
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 0.517
## emp_industry_2011PUBLIC ADMINISTRATION 1.116
## emp_industry_2011RETAIL TRADE -0.164
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 0.766
## emp_industry_2011UTILITIES 2.078
## emp_industry_2011WHOLESALE TRADE 0.599
## citizenshipunknown.birthplace -0.436
## citizenshipUnknown.not.us.born -0.791
## citizenshipUS.born.citizen -1.606
## sexFemale:raceHispanic -3.664
## sexFemale:raceMixed Race (Non-Hispanic) 0.081
## sexFemale:raceNon-Black / Non-Hispanic -4.772
## Pr(>|t|)
## (Intercept) 1.02e-07 ***
## sexFemale 2.85e-06 ***
## raceHispanic 4.09e-06 ***
## raceMixed Race (Non-Hispanic) 0.19751
## raceNon-Black / Non-Hispanic < 2e-16 ***
## total.incarcerations 3.86e-07 ***
## marital.statusMarried 0.00743 **
## marital.statusmissing 0.69240
## marital.statusNever married 0.01086 *
## marital.statusSeparated 0.97122
## marital.statusWidowed 0.77037
## work.limitationno 0.25374
## work.limitationyes 0.79361
## highest.degreeBachelor's degree (BA, BS) 1.69e-10 ***
## highest.degreeGED 8.01e-15 ***
## highest.degreeHigh school diploma 12 year 8.61e-09 ***
## highest.degreeMaster's degree (MA, MS) < 2e-16 ***
## highest.degreemissing 0.23219
## highest.degreeNone < 2e-16 ***
## highest.degreePhD 4.29e-08 ***
## highest.degreeProfessional degree (DDS, JD, MD) 7.03e-15 ***
## emp_industry_2011ACTIVE DUTY MILITARY 0.18104
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES 0.76343
## emp_industry_2011CONSTRUCTION 0.44088
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES 0.84687
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES 0.64595
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE 0.44101
## emp_industry_2011INFORMATION AND COMMUNICATION 0.52564
## emp_industry_2011MANUFACTURING 0.42897
## emp_industry_2011MINING 0.00959 **
## emp_industry_2011missing 0.65462
## emp_industry_2011OTHER SERVICES 0.62978
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES 0.60507
## emp_industry_2011PUBLIC ADMINISTRATION 0.26438
## emp_industry_2011RETAIL TRADE 0.86950
## emp_industry_2011TRANSPORTATION AND WAREHOUSING 0.44367
## emp_industry_2011UTILITIES 0.03780 *
## emp_industry_2011WHOLESALE TRADE 0.54938
## citizenshipunknown.birthplace 0.66320
## citizenshipUnknown.not.us.born 0.42921
## citizenshipUS.born.citizen 0.10837
## sexFemale:raceHispanic 0.00025 ***
## sexFemale:raceMixed Race (Non-Hispanic) 0.93538
## sexFemale:raceNon-Black / Non-Hispanic 1.88e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24340 on 4889 degrees of freedom
## Multiple R-squared: 0.2845, Adjusted R-squared: 0.2782
## F-statistic: 45.21 on 43 and 4889 DF, p-value: < 2.2e-16
Checking for collinearity between hard.drugs and hard.times
We wanted to check for collinearity between hard.drugs and hard.times because we felt they captured a similar effect on income. Usage of hard drugs can be considered experiencing a hard time, and could mean collinearity between the two.
The pairs plot for hard.drugs and hard.times shows a cause for concern. The pair: hard.drugs = no, hard.times = no has a considerably larger box in the top-right plot. This box represents the group size of people who said no to both categories. Therefore, suggesting that either one would capture the effect of the other variable without having to include the other variable in the regression. This shows collinearity between hard.drugs and hard.times, allowing us to drop one of the two.
##
## Call:
## lm(formula = income ~ sex + hard.times + hard.drugs, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52010 -20017 -4967 14983 104983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47099.8 2131.2 22.100 < 2e-16 ***
## sexFemale -11994.9 795.8 -15.073 < 2e-16 ***
## hard.timesno 959.2 1271.9 0.754 0.4508
## hard.timesyes -8851.0 2201.5 -4.020 5.9e-05 ***
## hard.drugsno 3952.6 1899.7 2.081 0.0375 *
## hard.drugsyes 1902.8 2463.8 0.772 0.4400
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27930 on 4927 degrees of freedom
## Multiple R-squared: 0.05005, Adjusted R-squared: 0.04909
## F-statistic: 51.92 on 5 and 4927 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex + hard.times
## Model 2: income ~ sex + hard.times + hard.drugs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4929 3.8490e+12
## 2 4927 3.8447e+12 2 4329116365 2.7739 0.06252 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: income ~ sex + hard.drugs
## Model 2: income ~ sex + hard.times + hard.drugs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4929 3.8656e+12
## 2 4927 3.8447e+12 2 2.0884e+10 13.382 1.6e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + hard.times, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51733 -19740 -4740 14265 105260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50548 1261 40.072 < 2e-16 ***
## sexFemale -11994 796 -15.068 < 2e-16 ***
## hard.timesno 1187 1268 0.936 0.349
## hard.timesyes -8706 2201 -3.955 7.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27940 on 4929 degrees of freedom
## Multiple R-squared: 0.04898, Adjusted R-squared: 0.0484
## F-statistic: 84.62 on 3 and 4929 DF, p-value: < 2.2e-16
The decision to drop either of the two: hard.drugs and hard.times depended on anova test to check the statistical significance of the remaining variable when one was dropped from the linear regression model. hard.times came out to be more statistically significant, therefore, hard.drugs was dropped.
##
## Call:
## lm(formula = income ~ sex, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51135 -20137 -4659 13863 105841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51136.7 559.2 91.44 <2e-16 ***
## sexFemale -11977.6 797.9 -15.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28020 on 4931 degrees of freedom
## Multiple R-squared: 0.0437, Adjusted R-squared: 0.04351
## F-statistic: 225.3 on 1 and 4931 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = income ~ sex + race, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55177 -19327 -4233 14673 108767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42641.5 891.1 47.852 < 2e-16 ***
## sexFemale -11408.1 784.9 -14.535 < 2e-16 ***
## raceHispanic 6480.3 1160.8 5.583 2.5e-08 ***
## raceMixed Race (Non-Hispanic) 11993.9 4090.4 2.932 0.00338 **
## raceNon-Black / Non-Hispanic 12685.7 953.1 13.310 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27520 on 4928 degrees of freedom
## Multiple R-squared: 0.07796, Adjusted R-squared: 0.07721
## F-statistic: 104.2 on 4 and 4928 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex
## Model 2: income ~ sex + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4931 3.8704e+12
## 2 4928 3.7318e+12 3 1.3863e+11 61.025 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + race + sex * race, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56052 -19202 -4458 13798 106018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39458 1152 34.259 < 2e-16
## sexFemale -5475 1572 -3.483 0.000501
## raceHispanic 10719 1659 6.460 1.15e-10
## raceMixed Race (Non-Hispanic) 16782 5614 2.989 0.002810
## raceNon-Black / Non-Hispanic 16744 1368 12.242 < 2e-16
## sexFemale:raceHispanic -8085 2320 -3.485 0.000496
## sexFemale:raceMixed Race (Non-Hispanic) -9361 8184 -1.144 0.252759
## sexFemale:raceNon-Black / Non-Hispanic -7791 1905 -4.090 4.38e-05
##
## (Intercept) ***
## sexFemale ***
## raceHispanic ***
## raceMixed Race (Non-Hispanic) **
## raceNon-Black / Non-Hispanic ***
## sexFemale:raceHispanic ***
## sexFemale:raceMixed Race (Non-Hispanic)
## sexFemale:raceNon-Black / Non-Hispanic ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27470 on 4925 degrees of freedom
## Multiple R-squared: 0.0815, Adjusted R-squared: 0.08019
## F-statistic: 62.43 on 7 and 4925 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex + race
## Model 2: income ~ sex + race + sex * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4928 3.7318e+12
## 2 4925 3.7174e+12 3 1.4341e+10 6.3332 0.000278 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + marital.status, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56702 -19491 -4914 14555 107624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47478 1300 36.513 < 2e-16 ***
## sexFemale -11988 784 -15.290 < 2e-16 ***
## marital.statusMarried 9423 1346 7.003 2.84e-12 ***
## marital.statusmissing -4827 5162 -0.935 0.350
## marital.statusNever married -2034 1377 -1.477 0.140
## marital.statusSeparated -3115 2957 -1.053 0.292
## marital.statusWidowed -5900 6774 -0.871 0.384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27460 on 4926 degrees of freedom
## Multiple R-squared: 0.08219, Adjusted R-squared: 0.08108
## F-statistic: 73.52 on 6 and 4926 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex
## Model 2: income ~ sex + marital.status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4931 3.8704e+12
## 2 4926 3.7146e+12 5 1.5579e+11 41.318 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + marital.status + sex * marital.status,
## data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59503 -19310 -4643 13836 106397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46581 1854 25.130 < 2e-16 ***
## sexFemale -10417 2453 -4.247 2.21e-05 ***
## marital.statusMarried 13122 2012 6.522 7.62e-11 ***
## marital.statusmissing -2804 6698 -0.419 0.6755
## marital.statusNever married -3978 2045 -1.945 0.0518 .
## marital.statusSeparated -10490 4152 -2.526 0.0116 *
## marital.statusWidowed 9752 15874 0.614 0.5390
## sexFemale:marital.statusMarried -7299 2696 -2.708 0.0068 **
## sexFemale:marital.statusmissing -4386 10468 -0.419 0.6753
## sexFemale:marital.statusNever married 4456 2757 1.617 0.1060
## sexFemale:marital.statusSeparated 15636 5894 2.653 0.0080 **
## sexFemale:marital.statusWidowed -19488 17544 -1.111 0.2667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27310 on 4921 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09141
## F-statistic: 46.11 on 11 and 4921 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex + marital.status
## Model 2: income ~ sex + marital.status + sex * marital.status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4926 3.7146e+12
## 2 4921 3.6691e+12 5 4.5514e+10 12.209 8.765e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: income ~ sex
## Model 2: income ~ sex + hard.drugs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4931 3.8704e+12
## 2 4929 3.8656e+12 2 4.819e+09 3.0724 0.0464 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + hard.drugs + sex * hard.drugs, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51466 -19468 -4465 14532 105535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45744 2567 17.817 <2e-16 ***
## sexFemale -8869 3704 -2.394 0.0167 *
## hard.drugsno 5724 2634 2.173 0.0299 *
## hard.drugsyes 4678 3470 1.348 0.1776
## sexFemale:hard.drugsno -3134 3799 -0.825 0.4095
## sexFemale:hard.drugsyes -5106 4936 -1.034 0.3010
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28010 on 4927 degrees of freedom
## Multiple R-squared: 0.0451, Adjusted R-squared: 0.04413
## F-statistic: 46.54 on 5 and 4927 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex + hard.drugs
## Model 2: income ~ sex + hard.drugs + sex * hard.drugs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4929 3.8656e+12
## 2 4927 3.8647e+12 2 847072797 0.54 0.5828
## [1] "total.incarcerations" "marijuana" "sex"
## [4] "marital.status" "high.school.diploma" "parenthood.by.20"
## [7] "chance.college.degree" "hard.times" "work.limitation"
## [10] "citizenship" "mother.edu" "father.edu"
## [13] "race" "hard.drugs" "public.private"
## [16] "highest.degree" "emp_industry_2011" "job.type"
## [19] "income"
## Analysis of Variance Table
##
## Model 1: income ~ sex
## Model 2: income ~ sex + hard.times
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4931 3.8704e+12
## 2 4929 3.8490e+12 2 2.1374e+10 13.686 1.183e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + hard.times + sex * hard.times, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51883 -19583 -4583 14475 105417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50524.6 1719.8 29.378 < 2e-16 ***
## sexFemale -11949.9 2384.9 -5.011 5.62e-07 ***
## hard.timesno 1360.2 1823.5 0.746 0.455753
## hard.timesyes -11294.0 3076.5 -3.671 0.000244 ***
## sexFemale:hard.timesno -352.2 2537.7 -0.139 0.889636
## sexFemale:hard.timesyes 5467.6 4407.4 1.241 0.214828
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27940 on 4927 degrees of freedom
## Multiple R-squared: 0.04943, Adjusted R-squared: 0.04847
## F-statistic: 51.25 on 5 and 4927 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex + hard.times
## Model 2: income ~ sex + hard.times + sex * hard.times
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4929 3.8490e+12
## 2 4927 3.8472e+12 2 1825592360 1.169 0.3108
## [1] "total.incarcerations" "marijuana" "sex"
## [4] "marital.status" "high.school.diploma" "parenthood.by.20"
## [7] "chance.college.degree" "hard.times" "work.limitation"
## [10] "citizenship" "mother.edu" "father.edu"
## [13] "race" "hard.drugs" "public.private"
## [16] "highest.degree" "emp_industry_2011" "job.type"
## [19] "income"
## Analysis of Variance Table
##
## Model 1: income ~ sex
## Model 2: income ~ sex + highest.degree
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4931 3.8704e+12
## 2 4923 3.1779e+12 8 6.9246e+11 134.09 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + highest.degree + sex * highest.degree,
## data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77200 -17314 -3011 12818 108989
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 53719.1 1936.7
## sexFemale -11243.5 2635.2
## highest.degreeBachelor's degree (BA, BS) 13448.6 2257.1
## highest.degreeGED -16339.6 2433.7
## highest.degreeHigh school diploma 12 year -6405.5 2078.5
## highest.degreeMaster's degree (MA, MS) 21463.0 3257.8
## highest.degreemissing -3474.9 2808.2
## highest.degreeNone -18082.6 2776.4
## highest.degreePhD 39530.9 12846.3
## highest.degreeProfessional degree (DDS, JD, MD) 44270.4 7899.2
## sexFemale:highest.degreeBachelor's degree (BA, BS) -4474.6 3058.2
## sexFemale:highest.degreeGED 148.1 3480.9
## sexFemale:highest.degreeHigh school diploma 12 year -5058.7 2864.5
## sexFemale:highest.degreeMaster's degree (MA, MS) -2910.0 4192.8
## sexFemale:highest.degreemissing 1720.1 3948.5
## sexFemale:highest.degreeNone -6828.7 3992.7
## sexFemale:highest.degreePhD 493.5 15775.3
## sexFemale:highest.degreeProfessional degree (DDS, JD, MD) -6046.1 10175.4
## t value Pr(>|t|)
## (Intercept) 27.738 < 2e-16 ***
## sexFemale -4.267 2.02e-05 ***
## highest.degreeBachelor's degree (BA, BS) 5.958 2.73e-09 ***
## highest.degreeGED -6.714 2.11e-11 ***
## highest.degreeHigh school diploma 12 year -3.082 0.00207 **
## highest.degreeMaster's degree (MA, MS) 6.588 4.92e-11 ***
## highest.degreemissing -1.237 0.21599
## highest.degreeNone -6.513 8.10e-11 ***
## highest.degreePhD 3.077 0.00210 **
## highest.degreeProfessional degree (DDS, JD, MD) 5.604 2.20e-08 ***
## sexFemale:highest.degreeBachelor's degree (BA, BS) -1.463 0.14349
## sexFemale:highest.degreeGED 0.043 0.96607
## sexFemale:highest.degreeHigh school diploma 12 year -1.766 0.07745 .
## sexFemale:highest.degreeMaster's degree (MA, MS) -0.694 0.48768
## sexFemale:highest.degreemissing 0.436 0.66313
## sexFemale:highest.degreeNone -1.710 0.08727 .
## sexFemale:highest.degreePhD 0.031 0.97505
## sexFemale:highest.degreeProfessional degree (DDS, JD, MD) -0.594 0.55242
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25400 on 4915 degrees of freedom
## Multiple R-squared: 0.2166, Adjusted R-squared: 0.2139
## F-statistic: 79.93 on 17 and 4915 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex + highest.degree
## Model 2: income ~ sex + highest.degree + sex * highest.degree
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4923 3.1779e+12
## 2 4915 3.1707e+12 8 7221855249 1.3993 0.1912
## [1] "total.incarcerations" "marijuana" "sex"
## [4] "marital.status" "high.school.diploma" "parenthood.by.20"
## [7] "chance.college.degree" "hard.times" "work.limitation"
## [10] "citizenship" "mother.edu" "father.edu"
## [13] "race" "hard.drugs" "public.private"
## [16] "highest.degree" "emp_industry_2011" "job.type"
## [19] "income"
## Analysis of Variance Table
##
## Model 1: income ~ sex
## Model 2: income ~ sex + job.type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4931 3.8704e+12
## 2 4899 3.0712e+12 32 7.9921e+11 39.839 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + job.type + sex * job.type, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66273 -16348 -3289 13189 110201
##
## Coefficients: (2 not defined because of singularities)
## Estimate
## (Intercept) 34896.3
## sexFemale -1229.6
## job.typeCLEANING AND BUILDING SERVICE -2675.6
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS 13718.4
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 9241.9
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS 14831.0
## job.typeENGINEERING AND RELATED TECHNICIANS 21633.1
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS 50945.8
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 10416.2
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS -2896.3
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 33262.2
## job.typeFARMING, FISHING, AND FORESTRY 23119.1
## job.typeFOOD PREPARATION -11696.3
## job.typeFOOD PREPARATIONS AND SERVING RELATED -5484.0
## job.typeHEALTH CARE TECHNICAL AND SUPPORT 6318.0
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS 43787.7
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS 17536.8
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 40385.1
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS 26853.7
## job.typeMANAGEMENT RELATED 34201.1
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS 33877.0
## job.typeMEDIA AND COMMUNICATION WORKERS 30389.4
## job.typeMILITARY SPECIFIC OCCUPATIONS 38334.5
## job.typemissing 9675.6
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS 8030.6
## job.typePERSONAL CARE AND SERVICE WORKERS -396.3
## job.typePHYSICAL SCIENTISTS 9025.9
## job.typePRODUCTION AND OPERATING WORKERS 8299.3
## job.typePROTECTIVE SERVICE 29191.9
## job.typeSALES AND RELATED WORKERS 18137.8
## job.typeSETTER, OPERATORS, AND TENDERS 9097.1
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS 13103.7
## job.typeTEACHERS 12959.6
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS 4087.4
## sexFemale:job.typeCLEANING AND BUILDING SERVICE -15132.6
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS -23135.0
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 1022.8
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS -26249.5
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS -8299.8
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS 9530.4
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS -1997.6
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS -859.4
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL -10118.3
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY -32357.2
## sexFemale:job.typeFOOD PREPARATION -4220.4
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED -4942.1
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT -10346.6
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS -15535.4
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS -13036.8
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS -15927.1
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS NA
## sexFemale:job.typeMANAGEMENT RELATED -11841.4
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS -3905.7
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS -25960.7
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS NA
## sexFemale:job.typemissing -9008.7
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS -7908.3
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS -9921.7
## sexFemale:job.typePHYSICAL SCIENTISTS 19140.7
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS -11278.4
## sexFemale:job.typePROTECTIVE SERVICE -23290.4
## sexFemale:job.typeSALES AND RELATED WORKERS -17005.9
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS -12584.6
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS 3377.9
## sexFemale:job.typeTEACHERS -5202.9
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS -11405.8
## Std. Error
## (Intercept) 9434.5
## sexFemale 17224.9
## job.typeCLEANING AND BUILDING SERVICE 9864.9
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS 9575.7
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 10360.2
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS 12068.6
## job.typeENGINEERING AND RELATED TECHNICIANS 11209.8
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS 10266.7
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 10548.1
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS 12918.7
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 9558.7
## job.typeFARMING, FISHING, AND FORESTRY 11702.0
## job.typeFOOD PREPARATION 12301.0
## job.typeFOOD PREPARATIONS AND SERVING RELATED 9710.3
## job.typeHEALTH CARE TECHNICAL AND SUPPORT 10190.4
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS 10774.9
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS 9620.3
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 11036.4
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS 15645.3
## job.typeMANAGEMENT RELATED 9730.0
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS 9727.4
## job.typeMEDIA AND COMMUNICATION WORKERS 10894.0
## job.typeMILITARY SPECIFIC OCCUPATIONS 11702.0
## job.typemissing 9838.6
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS 9599.8
## job.typePERSONAL CARE AND SERVICE WORKERS 10334.9
## job.typePHYSICAL SCIENTISTS 12579.3
## job.typePRODUCTION AND OPERATING WORKERS 10141.7
## job.typePROTECTIVE SERVICE 9755.9
## job.typeSALES AND RELATED WORKERS 9599.8
## job.typeSETTER, OPERATORS, AND TENDERS 9685.1
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS 12579.3
## job.typeTEACHERS 9978.5
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS 9555.1
## sexFemale:job.typeCLEANING AND BUILDING SERVICE 17839.9
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS 19018.1
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 17987.7
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS 19380.2
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS 25394.5
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS 20052.5
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 18365.0
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS 20228.6
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 17366.1
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY 20823.9
## sexFemale:job.typeFOOD PREPARATION 22688.5
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED 17512.8
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT 17742.4
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS 18115.6
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS 20101.8
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 18883.3
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS NA
## sexFemale:job.typeMANAGEMENT RELATED 17508.4
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS 17994.3
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS 18580.8
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS NA
## sexFemale:job.typemissing 17841.8
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS 17353.3
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS 17876.4
## sexFemale:job.typePHYSICAL SCIENTISTS 20441.3
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS 18315.7
## sexFemale:job.typePROTECTIVE SERVICE 17867.9
## sexFemale:job.typeSALES AND RELATED WORKERS 17402.3
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS 17733.3
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS 20441.3
## sexFemale:job.typeTEACHERS 17618.8
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS 17662.6
## t value
## (Intercept) 3.699
## sexFemale -0.071
## job.typeCLEANING AND BUILDING SERVICE -0.271
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS 1.433
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 0.892
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS 1.229
## job.typeENGINEERING AND RELATED TECHNICIANS 1.930
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS 4.962
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 0.988
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS -0.224
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 3.480
## job.typeFARMING, FISHING, AND FORESTRY 1.976
## job.typeFOOD PREPARATION -0.951
## job.typeFOOD PREPARATIONS AND SERVING RELATED -0.565
## job.typeHEALTH CARE TECHNICAL AND SUPPORT 0.620
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS 4.064
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS 1.823
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 3.659
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS 1.716
## job.typeMANAGEMENT RELATED 3.515
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS 3.483
## job.typeMEDIA AND COMMUNICATION WORKERS 2.790
## job.typeMILITARY SPECIFIC OCCUPATIONS 3.276
## job.typemissing 0.983
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS 0.837
## job.typePERSONAL CARE AND SERVICE WORKERS -0.038
## job.typePHYSICAL SCIENTISTS 0.718
## job.typePRODUCTION AND OPERATING WORKERS 0.818
## job.typePROTECTIVE SERVICE 2.992
## job.typeSALES AND RELATED WORKERS 1.889
## job.typeSETTER, OPERATORS, AND TENDERS 0.939
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS 1.042
## job.typeTEACHERS 1.299
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS 0.428
## sexFemale:job.typeCLEANING AND BUILDING SERVICE -0.848
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS -1.216
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 0.057
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS -1.354
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS -0.327
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS 0.475
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS -0.109
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS -0.042
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL -0.583
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY -1.554
## sexFemale:job.typeFOOD PREPARATION -0.186
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED -0.282
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT -0.583
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS -0.858
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS -0.649
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS -0.843
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS NA
## sexFemale:job.typeMANAGEMENT RELATED -0.676
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS -0.217
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS -1.397
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS NA
## sexFemale:job.typemissing -0.505
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS -0.456
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS -0.555
## sexFemale:job.typePHYSICAL SCIENTISTS 0.936
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS -0.616
## sexFemale:job.typePROTECTIVE SERVICE -1.303
## sexFemale:job.typeSALES AND RELATED WORKERS -0.977
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS -0.710
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS 0.165
## sexFemale:job.typeTEACHERS -0.295
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS -0.646
## Pr(>|t|)
## (Intercept) 0.000219
## sexFemale 0.943093
## job.typeCLEANING AND BUILDING SERVICE 0.786229
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS 0.152032
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 0.372405
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS 0.219173
## job.typeENGINEERING AND RELATED TECHNICIANS 0.053685
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS 7.21e-07
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 0.323446
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS 0.822616
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 0.000506
## job.typeFARMING, FISHING, AND FORESTRY 0.048251
## job.typeFOOD PREPARATION 0.341734
## job.typeFOOD PREPARATIONS AND SERVING RELATED 0.572261
## job.typeHEALTH CARE TECHNICAL AND SUPPORT 0.535290
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS 4.90e-05
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS 0.068379
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 0.000256
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS 0.086151
## job.typeMANAGEMENT RELATED 0.000444
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS 0.000501
## job.typeMEDIA AND COMMUNICATION WORKERS 0.005298
## job.typeMILITARY SPECIFIC OCCUPATIONS 0.001061
## job.typemissing 0.325443
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS 0.402895
## job.typePERSONAL CARE AND SERVICE WORKERS 0.969415
## job.typePHYSICAL SCIENTISTS 0.473086
## job.typePRODUCTION AND OPERATING WORKERS 0.413210
## job.typePROTECTIVE SERVICE 0.002783
## job.typeSALES AND RELATED WORKERS 0.058898
## job.typeSETTER, OPERATORS, AND TENDERS 0.347633
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS 0.297607
## job.typeTEACHERS 0.194087
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS 0.668834
## sexFemale:job.typeCLEANING AND BUILDING SERVICE 0.396344
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS 0.223864
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS 0.954656
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS 0.175655
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS 0.743807
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS 0.634615
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 0.913389
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS 0.966115
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL 0.560158
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY 0.120285
## sexFemale:job.typeFOOD PREPARATION 0.852442
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED 0.777802
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT 0.559817
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS 0.391172
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS 0.516666
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS 0.399018
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS NA
## sexFemale:job.typeMANAGEMENT RELATED 0.498865
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS 0.828177
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS 0.162424
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS NA
## sexFemale:job.typemissing 0.613640
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS 0.648607
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS 0.578906
## sexFemale:job.typePHYSICAL SCIENTISTS 0.349128
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS 0.538071
## sexFemale:job.typePROTECTIVE SERVICE 0.192474
## sexFemale:job.typeSALES AND RELATED WORKERS 0.328508
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS 0.477951
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS 0.868757
## sexFemale:job.typeTEACHERS 0.767775
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS 0.518468
##
## (Intercept) ***
## sexFemale
## job.typeCLEANING AND BUILDING SERVICE
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS
## job.typeENGINEERING AND RELATED TECHNICIANS .
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS ***
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL ***
## job.typeFARMING, FISHING, AND FORESTRY *
## job.typeFOOD PREPARATION
## job.typeFOOD PREPARATIONS AND SERVING RELATED
## job.typeHEALTH CARE TECHNICAL AND SUPPORT
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS ***
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS .
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS ***
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS .
## job.typeMANAGEMENT RELATED ***
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS ***
## job.typeMEDIA AND COMMUNICATION WORKERS **
## job.typeMILITARY SPECIFIC OCCUPATIONS **
## job.typemissing
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS
## job.typePERSONAL CARE AND SERVICE WORKERS
## job.typePHYSICAL SCIENTISTS
## job.typePRODUCTION AND OPERATING WORKERS
## job.typePROTECTIVE SERVICE **
## job.typeSALES AND RELATED WORKERS .
## job.typeSETTER, OPERATORS, AND TENDERS
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS
## job.typeTEACHERS
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS
## sexFemale:job.typeCLEANING AND BUILDING SERVICE
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY
## sexFemale:job.typeFOOD PREPARATION
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS
## sexFemale:job.typeMANAGEMENT RELATED
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS
## sexFemale:job.typemissing
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS
## sexFemale:job.typePHYSICAL SCIENTISTS
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS
## sexFemale:job.typePROTECTIVE SERVICE
## sexFemale:job.typeSALES AND RELATED WORKERS
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS
## sexFemale:job.typeTEACHERS
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24960 on 4869 degrees of freedom
## Multiple R-squared: 0.2504, Adjusted R-squared: 0.2407
## F-statistic: 25.82 on 63 and 4869 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex + job.type
## Model 2: income ~ sex + job.type + sex * job.type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4899 3.0712e+12
## 2 4869 3.0337e+12 30 3.7488e+10 2.0056 0.0009284 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1391 0.0000 11.0000
##
## Call:
## lm(formula = income ~ sex, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51135 -20137 -4659 13863 105841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51136.7 559.2 91.44 <2e-16 ***
## sexFemale -11977.6 797.9 -15.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28020 on 4931 degrees of freedom
## Multiple R-squared: 0.0437, Adjusted R-squared: 0.04351
## F-statistic: 225.3 on 1 and 4931 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = income ~ sex + total.incarcerations, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52705 -19484 -4855 14516 105516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52855.3 575.4 91.85 <2e-16 ***
## sexFemale -13370.9 799.3 -16.73 <2e-16 ***
## total.incarcerations -7437.5 691.2 -10.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27700 on 4930 degrees of freedom
## Multiple R-squared: 0.06564, Adjusted R-squared: 0.06526
## F-statistic: 173.2 on 2 and 4930 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex
## Model 2: income ~ sex + total.incarcerations
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4931 3.8704e+12
## 2 4930 3.7816e+12 1 8.8801e+10 115.77 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = income ~ sex + total.incarcerations + sex * total.incarcerations,
## data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52678 -19517 -4517 14483 105483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52828.4 579.0 91.245 <2e-16 ***
## sexFemale -13311.8 811.5 -16.404 <2e-16 ***
## total.incarcerations -7321.0 744.0 -9.840 <2e-16 ***
## sexFemale:total.incarcerations -852.1 2012.7 -0.423 0.672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27700 on 4929 degrees of freedom
## Multiple R-squared: 0.06568, Adjusted R-squared: 0.06511
## F-statistic: 115.5 on 3 and 4929 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: income ~ sex + total.incarcerations
## Model 2: income ~ sex + total.incarcerations + sex * total.incarcerations
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4930 3.7816e+12
## 2 4929 3.7815e+12 1 137502914 0.1792 0.6721
Interaction Terms a) Checking for statistically significant interactions with variable sex
We checked for interaction of all the variables shortlisted for the final lm with variable sex. We found only 3 interaction terms to be statistically significant: 1) sexrace 2) sexmarital.status 3) sex*highest.degree
Following is the discussion on the 3 statistically significant interactions with the variable sex, with the purpose of rejecting or accepting participation in the final lm:
sex*race: Sex and race appear to have a compounding effect on the income variable.Sex variable has a statistically significant impact on the income, showcased in the income gap between men and women. This income gap can be further impacted by differences in the racial groups, owing to differences in social constraints upon the the different sexes in different racial groups. For example, on average brown women may find themselves impacted considerably greater by domestic and household pressures compared to white women. The former’s social set-up requires them to focus on domestic chores and child-rearing and provides brown men greater freedom in pursuing professional careers resulting in a wider income gap. Therefore, it would make sense to include the statistically significant sex and race interaction term in the linear model to check for impact on income gap.
sex*highest.degree: Highest degree is an important variable defining education in our linear regression model. It can have a compounded effect along with the sex variable on income. For example, an already disadvantaged sex/gender in terms of income may be further disadvantaged with lower education levels. For similar qualifications women tend to earn less than men in the US, which serves as an indicator of a compounded effect of sex and education on income. One study indicates that the income gap widens with a college degree, that is income gap between men and women with a college degree is wider than the income gap between men and women without a college degree, serving further evidence for the need of an interaction term in our regression model (Day, 2019). Therefore, it would make sense to include the statistically significant sex and highest.degree interaction term in the linear model to check for impact on the income gap. While this seems interesting, we have chosen to ignore it to avoid complexities.
Sources: Day, J. C. (2019, May 29). College Degree Widens Gender Earnings Gap - U.S. Census Bureau. Retrieved November 13, 2020, from https://www.census.gov/library/stories/2019/05/college-degree-widens-gender-earnings-gap.html
## [1] "total.incarcerations" "marijuana" "sex"
## [4] "marital.status" "high.school.diploma" "parenthood.by.20"
## [7] "chance.college.degree" "hard.times" "work.limitation"
## [10] "citizenship" "mother.edu" "father.edu"
## [13] "race" "hard.drugs" "public.private"
## [16] "highest.degree" "emp_industry_2011" "job.type"
## [19] "income"
##
## Call:
## lm(formula = income ~ sex + marital.status, data = renamed_nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56702 -19491 -4914 14555 107624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47478 1300 36.513 < 2e-16 ***
## sexFemale -11988 784 -15.290 < 2e-16 ***
## marital.statusMarried 9423 1346 7.003 2.84e-12 ***
## marital.statusmissing -4827 5162 -0.935 0.350
## marital.statusNever married -2034 1377 -1.477 0.140
## marital.statusSeparated -3115 2957 -1.053 0.292
## marital.statusWidowed -5900 6774 -0.871 0.384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27460 on 4926 degrees of freedom
## Multiple R-squared: 0.08219, Adjusted R-squared: 0.08108
## F-statistic: 73.52 on 6 and 4926 DF, p-value: < 2.2e-16
The coefficient of sexFemale indicates that females earn $12,676 lower that males, which is the baseline in our dataset.
The coefficient for raceHispanic indicates that individuals in Hispanic race on average earn $7,154 more than Black individuals.
The coefficient for raceNon-Black/ Non-Hispanic indicates that individuals in this race earn $10,422 greater than Black individuals.
The coefficient for total.incarcerations indicates that for each additional incarceration, the income falls by $3212.
The coefficient for marital.statusMarried shows that individuals who are married earn $3,252 more than individuals who are divorced.
The coefficient for marital.statusNever_married shows that individuals who are have never gotten married earn $3,156 less than individuals who are divorced.
We can see that the factor level UTILITIES in emp_industry_2011 (which describes the employment industry)’s coefficient is statistically significant. On average the coefficient increases by $20796 when compared to the baseline of ACS SPECIAL CODE.
We can see that the factor level MINING in emp_industry_2011 (which describes the employment industry)’s coefficient is statistically significant. On average the coefficient increases by $25441 when compared to the baseline of ACS SPECIAL CODE.
We can see that the factor level Professional Degree in highest degree (which describes the highest degree a person has attained)’s coefficient is statistically significant. On average the coefficient increases by $37439 when compared to the baseline of Associate Junior College.
highest degree’s factor PhD is also very significant with a p-value of 4.29e-08. Its coefficient increases by $39296 when compared to the baseline of Associate Junior College.
highest degree’s factor None is also very significant with a p-value of less than 2e-16. Its coefficient decreases by $18042 when compared to the baseline of Associate Junior College.
highest degree’s factor Master’s degree MA/MS is also very significant with a p-value of less than 2e-16. Its coefficient increases by $18418 when compared to the baseline of Associate Junior College.
highest degree’s factor High school diploma is also very significant with a p-value of 8.61e-09. Its coefficient decreases by 7967 when compared to the baseline of Associate Junior College.
highest degree’s factor GED is also very significant with a p-value of 8.01e-15. Its coefficient decreases by 13179 when compared to the baseline of Associate Junior College.
highest degree’s factor Bachelor’s degree is also very significant with a p-value of less than 1.69e-10. Its coefficient increases by 9436 when compared to the baseline of Associate Junior College.
The sex and race interaction term gives an interesting perspective in terms of describing the income gap between men and women. The sexFemale:raceHispanic variable and sexFemale:raceNon-Black/Non-Hispanic have coefficients -7575.78 and -8128.30, that are statistically significant at p-value = 0.00025 and p-value = 1.88e-06, respectively. The sexFemale:raceHispanic suggests that the income gap between men and women of the race Hispanic is on average 7575.8 dollars less than the income gap of our baseline race black. The sexFemale:raceNon-Black/Non-Hispanic suggests that the income gap between men and women of the race raceNon-Black / Non-Hispanic is on average 8128.3 dollars less than the income gap of our baseline race black.
All other variables were found to be statistically insignificant at the 0.05 significance level. #### a. Analyzing Diagnostic Plots
lm.final <- lm(income ~ sex + race + sex*race + highest.degree +
total.incarcerations + marital.status + hard.times +
citizenship + emp_industry_2011 + job.type, data=renamed_nlsy)
plot(lm.final)Residual v/s Fitted: For every observation in the data, the linear regression model predicts a fitted value. This is basically the estimate of income based on the model which was fit. The residual is the difference between the fitted value and the actual observed value of income. We are interested in assessing the non-linearity here and check whether the constant variance assumption is violated. We can see that the model is a good fit given the average residual as indicated by the red line falls close to 0, showing that the difference between the actual v/s fitted value is minimal.
Normal Q-Q plot: The residuals appear to be following the line well except on the upper tail where there is a clear deviation from normality. This may be a result of having majority of categorical variables in our data which causes our income to take select values only.
Residual v/s Leverage: A problematic outlier for us would be one which has high residual and high leverage. This means that the point is away from what the line predicts, and has a high influence on the linear model that we fit. However, since we do not have any points lying outside our Cook’s distance, there are no points which are concerning outliers for us.
The scale-location provides a clearer picture of non-constant variance. The upward sloping line tells us that there is non-constant variance present in the model, or in other words the error of the model is not constant across all observations. Although, ideally we should have a flat line centered at 1 i.e. the ability of the model to predict income does not vary across values. However, the scenario here says that as predicted values get higher so does our error, which may question our inferences.